Red Wine Data Analysis by Sourabh Dev

Univariate Plots Section

Lets start by looking at the data summary

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

From this summary we can see some broad categories like: acidity, sugar, chemical groups, quality, alcohol content.

Lets start by plotting the quality This looks like a normal distribution.

To continue this analysis further, lets look at the: density, alcohol levels and sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The density looks like a normal distribution and the alcohol data is a little skewed. We can see a large spike in the alcohol level around 9.5%.

Sugar seems to be skewed drastically, it would make sense to test it on a log scale.

Nothing significant can be seen here.

Now, lets look at the acidity

pH seems to follow a normal distribution, with the largest concentration around 3.3.

Looks like the fixed and volatile acidity seems to skewed. But, no pattern is visible in case of the citric acid levels. So, lets further explore it.

It seems skewed when measured on a log scale.

Finally, lets explore the chemical levels

These plots look like normal distributions if we remove the outliers.

Both distributions are skewed. # Univariate Analysis

What is the structure of your dataset?

The are 1599 different wine bottles and the dataset has 13 features (“fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar”,“chlorides”,“free.sulfur.dioxide”,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol”,“quality”).

Some interesting observations: * Majority of the wines are rate a quality of 5 or 6. * The alcohol levels are skewed with a large spike at 9.5%. * The median pH values is at 3.31.

What is/are the main feature(s) of interest in your dataset?

The main feature in this dataset is the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The main features of interest are citric.acid, residual.sugar, ph and alcohol. It would be interesting to see how these variables effect the quality.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Citric acid and Alcohol seem to be a little unusual. Alcohol seems to have a skewed distribution with a sudden did, it’s looks almost bimodal. While citric acid is skewed on the log scale along the x axis.

No aditional changes were made.

Bivariate Plots Section

##                  fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity       1.00000000     -0.256130895   0.6717034    0.114776724
## volatile.acidity   -0.25613089      1.000000000  -0.5524957    0.001917882
## citric.acid         0.67170343     -0.552495685   1.0000000    0.143577162
## residual.sugar      0.11477672      0.001917882   0.1435772    1.000000000
## density             0.66804729      0.022026232   0.3649472    0.355283371
## pH                 -0.68297819      0.234937294  -0.5419041   -0.085652422
## alcohol            -0.06166827     -0.202288027   0.1099032    0.042075437
## quality             0.12405165     -0.390557780   0.2263725    0.013731637
##                      density          pH     alcohol     quality
## fixed.acidity     0.66804729 -0.68297819 -0.06166827  0.12405165
## volatile.acidity  0.02202623  0.23493729 -0.20228803 -0.39055778
## citric.acid       0.36494718 -0.54190414  0.10990325  0.22637251
## residual.sugar    0.35528337 -0.08565242  0.04207544  0.01373164
## density           1.00000000 -0.34169933 -0.49617977 -0.17491923
## pH               -0.34169933  1.00000000  0.20563251 -0.05773139
## alcohol          -0.49617977  0.20563251  1.00000000  0.47616632
## quality          -0.17491923 -0.05773139  0.47616632  1.00000000

Lets draw a correlation plot to have a better understaing.

From the above table and plot matrix we see “fixed.acidity”, “volatile.acidity” and “pH” has some correlation with “citric.acid”. Interestingly, density has some correlation with “fixed.acidity” and “alcohol”. Also, “quality” has some correlation with “alcohol”.

Lets now look at pH, fixed.acidity and volatile.acidity versus citric.acid.

## 
## Call:
## lm(formula = citric.acid ~ pH, data = analysis_winedata)
## 
## Coefficients:
## (Intercept)           pH  
##      2.5350      -0.6838

From the scatter plot we can see that the data seems to be slightly negatively correlated.

## 
## Call:
## lm(formula = citric.acid ~ fixed.acidity, data = analysis_winedata)
## 
## Coefficients:
##   (Intercept)  fixed.acidity  
##      -0.35427        0.07515

From the scatter plot we can see that the data seems to be slightly positively correlated.

## 
## Call:
## lm(formula = citric.acid ~ volatile.acidity, data = analysis_winedata)
## 
## Coefficients:
##      (Intercept)  volatile.acidity  
##           0.5882           -0.6011

This data looks very similar to pH vs citric acid levels. Maybe pH and volatile.acidity have some relationship. Let’s try to plot it.

## 
## Call:
## lm(formula = pH ~ volatile.acidity, data = analysis_winedata)
## 
## Coefficients:
##      (Intercept)  volatile.acidity  
##           3.2042            0.2026

There definitly seems to be some sort of correlation here.

Now, lets look at denisty vs alcohol and density vs fixed.acidity.

## 
## Call:
## lm(formula = density ~ alcohol, data = analysis_winedata)
## 
## Coefficients:
## (Intercept)      alcohol  
##   1.0059059   -0.0008788

The general trend here seems to be that alcohol levels decrease with density. Which does make sense as alcohol is lighter than water and more alcohol means less water, hence lower density.

There is a clearcut linear relationship between fixed acidity and density. The acidity goes up with the density.

Now, lets more to the most interesting plot between alcohol and quality.

## $`3`
##    vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 10 9.96 0.82   9.93   10.02 0.78 8.4  11   2.6 -0.41    -0.99 0.26
## 
## $`4`
##    vars  n  mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 53 10.27 0.93     10   10.21 1.19   9 13.1   4.1 0.61    -0.23
##      se
## X1 0.13
## 
## $`5`
##    vars   n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 681  9.9 0.74    9.7    9.79 0.44 8.5 14.9   6.4 1.83     5.25
##      se
## X1 0.03
## 
## $`6`
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 638 10.63 1.05   10.5   10.56 1.19 8.4  14   5.6 0.54    -0.16
##      se
## X1 0.04
## 
## $`7`
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 199 11.47 0.96   11.5   11.47 1.04 9.2  14   4.8 0.01    -0.47
##      se
## X1 0.07
## 
## $`8`
##    vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 18 12.09 1.22  12.15   12.12 1.19 9.8  14   4.2 -0.2    -0.98 0.29
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)

There seems to be a positive correlation, except in the case of wines rates 5 in quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Most of the comparisons made with citric acid showed some type of linear realtionship.

The comparision between alcohol and density proved the hypothesis that wines having low alcohol levels have high concentration of water, hence lower higher in density as water is more dense.

Finally, quality and alcohol showed an increasing linear relationship. But, there is a suddent dip in case of wine with quality ‘5’.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As mentioned above the dip in quality vs alcohol is very intersting.

What was the strongest relationship you found?

pH and fixed acidity seem to have the strongest correlation.

Multivariate Plots Section

In the above plot of Alcohol vs Density vs Quality. We can see that alcohols rated 5 in quality are on the more denser while having low alcohol content.

No significant observations can be derived from this plot.

There are no interesting patterns here.

Clearly acidity varies negatively with the pH. But, the quality seems to be uniform.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From the first graph it seems to be that the density has a inverse relationship with quality. Denser the wine, lower it’s score.

Were there any interesting or surprising interactions between features?

No.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Their seems to be spikes in the citric acid instead of the expected normal distributions.

Plot Two

Description Two

This boxplot shows how quality varies with alcohol level with a dip at ‘5’.

Plot Three

Description Three

This plot shows how residual sugar varies with alcohol. Even though the alcohol levels vary widely with sugar, there is a clear preference for wines with lower amount of residual sugar.

Reflection

The take aways from this analysis are that wines with high quality tend to have higher alcohol content and low residual sugar. Another interesting finding was that citric acidity decreases with pH levels. So, wines with lower acidty have higher citric acid content.

In conclusion, if you are looking for a good bottle of wine. It will most like have very little sweetness to it, but will be strong.